DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods
Recall the model equation for the best-fit line1 is
\[ y_i = \beta_0 + \beta_1 \times x_i + \varepsilon_i, ~ \text{where} ~ \varepsilon_i \sim \text{Normal}(0, \sigma_\varepsilon) \]
We want to estimate the values of \(\beta_0\), \(\beta_1\), and \(\sigma_\varepsilon\) with the \(n\) observations in our data—By minimising the sum of squares for residuals, \(SSR\)
Call:
lm(formula = Premolt ~ Postmolt, data = crabs.df)
Residuals:
Min 1Q Median 3Q Max
-4.6485 -1.3060 0.0829 1.2683 11.0291
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28.94743 1.55480 -18.62 <2e-16 ***
Postmolt 1.09948 0.01079 101.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.996 on 359 degrees of freedom
Multiple R-squared: 0.9666, Adjusted R-squared: 0.9665
F-statistic: 1.039e+04 on 1 and 359 DF, p-value: < 2.2e-16
Fitted intercept term, \(\widehat{\beta}_0\)
Fitted gradient term, \(\widehat{\beta}_1\)
Standard error of the residuals, \(\widehat{\sigma}_\varepsilon\)
The best “guess”1 of \(\beta_1\) according to our data is denoted as \(\widehat{\beta}_1\)
By itself, it tells us the direction of the fitted linear relationship and the average change in the response variable as the explanatory variable increases by one unit
Call:
lm(formula = Premolt ~ Postmolt, data = crabs.df)
Residuals:
Min 1Q Median 3Q Max
-4.6485 -1.3060 0.0829 1.2683 11.0291
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28.94743 1.55480 -18.62 <2e-16 ***
Postmolt 1.09948 0.01079 101.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.996 on 359 degrees of freedom
Multiple R-squared: 0.9666, Adjusted R-squared: 0.9665
F-statistic: 1.039e+04 on 1 and 359 DF, p-value: < 2.2e-16
We are 95% sure that a one mm increase in postmolt size increases the average premolt size by somewhere between 1.08 and 1.12 mm
What if we wanted to use the model to infer the average change in premolt size if we increase the postmolt size by five mm?
2.5 % 97.5 %
(Intercept) -32.005093 -25.889773
Postmolt 1.078267 1.120694
A p-value is also reported in the summary() output for \(\widehat{\beta}_1\). What is the null hypothesis it is generating evidence against in favour of the alternative hypothesis?
\[ H_0\!: \beta_1 = 0 \\ H_1\!: \beta_1 \neq 0 \]
We have very strong evidence against the null hypothesis that the slope term is equal to 0 in favour that it is not (p-value = 0)
Call:
lm(formula = Premolt ~ Postmolt, data = crabs.df)
Residuals:
Min 1Q Median 3Q Max
-4.6485 -1.3060 0.0829 1.2683 11.0291
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28.94743 1.55480 -18.62 <2e-16 ***
Postmolt 1.09948 0.01079 101.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.996 on 359 degrees of freedom
Multiple R-squared: 0.9666, Adjusted R-squared: 0.9665
F-statistic: 1.039e+04 on 1 and 359 DF, p-value: < 2.2e-16
The best “guess”1 of \(\beta_0\) according to our data is denoted as \(\widehat{\beta}_0\)
By itself, it tells us average of the response, \(\hat{y}_i\), when the value of the explanatory, \(x_i\), is equal to 0. That is, the value where the best-fit line cuts the y-axis
Call:
lm(formula = Premolt ~ Postmolt, data = crabs.df)
Residuals:
Min 1Q Median 3Q Max
-4.6485 -1.3060 0.0829 1.2683 11.0291
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28.94743 1.55480 -18.62 <2e-16 ***
Postmolt 1.09948 0.01079 101.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.996 on 359 degrees of freedom
Multiple R-squared: 0.9666, Adjusted R-squared: 0.9665
F-statistic: 1.039e+04 on 1 and 359 DF, p-value: < 2.2e-16
We estimate with 95% confidence that the mean premolt size is somewhere between -32.0 and -25.0 mm when the postmolt size is 0 mm
In practice, the interpretation of \(\beta_0\) is not of interest and we do not mind if it is plausibly equal to 0. However, we need it to fit the best-fit line to our data!
A p-value is also reported in the summary() output for \(\widehat{\beta}_0\). What is the null hypothesis it is generating evidence against in favour of the alternative hypothesis?
\[ H_0\!: \beta_0 = 0 \\ H_1\!: \beta_0 \neq 0 \]
We have very strong evidence against the null hypothesis that the intercept term is equal to 0 in favour that it is not (p-value = 0)
Call:
lm(formula = Premolt ~ Postmolt, data = crabs.df)
Residuals:
Min 1Q Median 3Q Max
-4.6485 -1.3060 0.0829 1.2683 11.0291
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28.94743 1.55480 -18.62 <2e-16 ***
Postmolt 1.09948 0.01079 101.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.996 on 359 degrees of freedom
Multiple R-squared: 0.9666, Adjusted R-squared: 0.9665
F-statistic: 1.039e+04 on 1 and 359 DF, p-value: < 2.2e-16
Like one-way ANOVA, we could also decompose the “variability” of the response variable by what is described by the best-fit line and the residuals. Hence we could conduct a F-test with the following hypotheses
\[ \begin{aligned} H_0\!&: \text{The model is ineffective} \\ H_1\!&: \text{The model is effective} \end{aligned} \]
We have very strong evidence against the null hypothesis that the fitted model is ineffective in modelling a linear relationship between premolt and postmolt sizes of crabs in favour that it is not (p-value = 0)
Analysis of Variance Table
Response: Premolt
Df Sum Sq Mean Sq F value Pr(>F)
Postmolt 1 41393 41393 10389 < 2.2e-16 ***
Residuals 359 1430 4
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] 42823
Simulated data based on an 24 hour open-book and online exam held at a real university
Lecturers collected data from an introductory statistics paper to investigate whether such a pattern existedIn one trimester, exams were held over 24 hours, though students could still do many exams within 2 hours. An exam ran from 1pm one day to 1pm the next day. The lecturers were interested if there would be any pattern in the exam marks depending on the final submission time within these 24 hours
| Variables | |
|---|---|
| Score | An integer denoting the student’s exam mark out of 80 |
| Submit | A number denoting the number of hours after the exam was released before the student submitted their exam. This is a number between 0 and 24 |
Call:
lm(formula = Score ~ Submit, data = submit.df)
Residuals:
Min 1Q Median 3Q Max
-42.063 -9.381 -0.089 9.893 27.935
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.57575 0.58546 89.803 <2e-16 ***
Submit -0.02137 0.03524 -0.607 0.544
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.22 on 2281 degrees of freedom
Multiple R-squared: 0.0001613, Adjusted R-squared: -0.0002771
F-statistic: 0.3679 on 1 and 2281 DF, p-value: 0.5442
We can use the lm() R function to fit only \(\beta_0\) and \(\varepsilon_i\) to \(y_i\) like so
Score ~ 1 has a special meaning associated with it, that is, fit a single mean to the response variable…
What is the difference between the two methods…?
Call:
lm(formula = Score ~ 1, data = submit.df)
Residuals:
Min 1Q Median 3Q Max
-42.263 -9.263 -0.263 9.737 27.737
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.2628 0.2767 188.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.22 on 2282 degrees of freedom
Call:
lm(formula = Score ~ 1, data = submit.df)
Residuals:
Min 1Q Median 3Q Max
-42.263 -9.263 -0.263 9.737 27.737
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.2628 0.2767 188.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.22 on 2282 degrees of freedom
2.5 % 97.5 %
(Intercept) 51.72024 52.80539